-
Notifications
You must be signed in to change notification settings - Fork 664
[Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods #2259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods #2259
Conversation
1478100 to
42b99fd
Compare
…flect the result of reconcilePods Signed-off-by: Rueian <[email protected]>
…flect the result of reconcilePods Signed-off-by: Rueian <[email protected]>
4d8a1e6 to
41462a9
Compare
| meta.SetStatusCondition(&newInstance.Status.Conditions, metav1.Condition{ | ||
| Type: string(rayv1.RayClusterReplicaFailure), | ||
| Status: metav1.ConditionTrue, | ||
| Reason: "FailedReconcilePods", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing to reconcile pods is different from replica failure, do we need to make this distinction clearer?
For example, replica failure could mean one of the pods are unhealthy and need to be recreated, but a reconcile error could happen for other reasons before the pod is running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andrewsykim,
Possible distinctions are those in the previous draft:
- FailedDeleteAllPods
- FailedDeleteHeadPod
- FailedCreateHeadPod
- FailedDeleteWorkerPod
- FailedCreateWorkerPod
Do they make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but my point is that maybe reconcile error and replica failures are better off as separate conditions instead of different reasons within the same condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We borrow the idea of the replica failures condition from the Kubernetes ReplicaSet controller. It has only two condition reasons:
- FailedCreate
- FailedDelete
And it doesn't reflect the health of pods either. I guess that is because the health of pods is handled by the pod controller already.
The FailedReconcilePods reason in this PR may be oversimplified and exceeds the semantic of ReplicaFailure indeed. I think I can bring the previous draft back by introducing them as individual error instances and wrapping them at the appropriate points. However, those points are still in the reconcilePods function.
var (
failedDeleteAllPods = errstd.New("...")
failedDeleteHeadPod = errstd.New("...")
failedCreateHeadPod = errstd.New("...")
failedDeleteWorkerPod = errstd.New("...")
failedCreateWorkerPod = errstd.New("...")
)
func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv1.RayCluster) error {
...
if err := r.createHeadPod(ctx, *instance); err != nil {
common.FailedClustersCounterInc(instance.Namespace)
return errstd.Join(failedCreateHeadPod, err)
}
...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var (
failedDeleteAllPods = errstd.New("...")
failedDeleteHeadPod = errstd.New("...")
failedCreateHeadPod = errstd.New("...")
failedDeleteWorkerPod = errstd.New("...")
failedCreateWorkerPod = errstd.New("...")
)
...This makes sense to me.
|
|
||
| func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv1.RayCluster) error { | ||
| if err := r.doReconcilePods(ctx, instance); err != nil { | ||
| return fmt.Errorf("%w: %w", reconcilePodsErr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use https://pkg.go.dev/errors#Join and Unwrap instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To unwrap a joined error, we need a special trick:
errs := err.(interface{ Unwrap() []error }).Unwrap()| } | ||
|
|
||
| // reconcilePodsErr is a marker used by the calculateStatus() for setting the RayClusterReplicaFailure condition. | ||
| var reconcilePodsErr = errstd.New("reconcile pods error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to constant.go?
| return true | ||
| } | ||
| if !reflect.DeepEqual(oldStatus.Conditions, newStatus.Conditions) { | ||
| logger.Info("inconsistentRayClusterStatus", "detect inconsistency", fmt.Sprintf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using fmt.Sprintf. Please use
logger.Info(..., ..., "old conditions", oldStatus.Conditions, "new conditions", newStatus.Conditions)instead. See this doc for more details.
| meta.SetStatusCondition(&newInstance.Status.Conditions, metav1.Condition{ | ||
| Type: string(rayv1.RayClusterReplicaFailure), | ||
| Status: metav1.ConditionTrue, | ||
| Reason: "FailedReconcilePods", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var (
failedDeleteAllPods = errstd.New("...")
failedDeleteHeadPod = errstd.New("...")
failedCreateHeadPod = errstd.New("...")
failedDeleteWorkerPod = errstd.New("...")
failedCreateWorkerPod = errstd.New("...")
)
...This makes sense to me.
Signed-off-by: Rueian <[email protected]>
…flect the result of reconcilePods Signed-off-by: Rueian <[email protected]>
…flect the result of reconcilePods Signed-off-by: Rueian <[email protected]>
| ) | ||
|
|
||
| func RayClusterReplicaFailureReason(err error) string { | ||
| if e, ok := err.(interface{ Unwrap() []error }); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err.(interface{ Unwrap() []error }).Unwrap() is the only special trick to unwrap a joined error.
https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/errors/join.go;l=60-62
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just left a nit comment.
| } | ||
| } | ||
|
|
||
| func TestErrRayClusterReplicaFailureReason(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test the case that the error is not a RayClusterReplicaFailure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will add a test that the RayClusterReplicaFailureReason() should return an empty string for a random error.
Signed-off-by: Rueian <[email protected]>

This is a revised draft of #2245 based on the review.
In this version, a
RayClusterReplicaFailurecondition will be set whenever thereconcilePods()returns areconcileErr. This is done by thecalculateStatus()based on thereconcileErrsolely. It assumes thereconcilePods()is the last reconciliation function.We also consolidate the fine-grained condition reasons in the previous draft into one
FailedReconcilePodsfor simplicity.Tests are conducted in three parts:
reconcilePods()will return areconcileErr.calculateStatus()will set the condition only when the feature gate is enabled.inconsistentRayClusterStatus()will reporttruewhen conditions are changed.Checks